Introduction to Scikit-Learn

View this IPython Notebook:

j.mp/sklearn

Everything is in a Github repo:

github.com/tdhopper/

View slides with:

ipython nbconvert Intro\ to\ Scikit-Learn.ipynb --to slides --post serve

# Introduction to Scikit-Learn

__Research Triangle Analysts (1/16/13)__

Software Engineer at [parse.ly](http://www.parse.ly)
@tdhopper
tdhopper@gmail.com

What is Scikit-Learn?

"Machine Learning in Python"

Classification
Regression
Clustering
Dimensionality Reduction
Model Selection
Preprocessing

See more: http://scikit-learn.org/stable/user_guide.html

Why scikit-learn?

Six reasons why Ben Lorica (@bigdata) recommends scikit-learn

One: Commitment to documentation and usability

One of the reasons I started using scikit-learn was because of its nice documentation (which I hold up as an example for other communities and projects to emulate).

Six reasons why Ben Lorica (@bigdata) recommends scikit-learn

Two: Models are chosen and implemented by a dedicated team of experts

Scikit-learn’s stable of contributors includes experts in machine-learning and software development.

Six reasons why Ben Lorica (@bigdata) recommends scikit-learn

Three: Covers most machine-learning tasks

Scan the list of things available in scikit-learn and you quickly realize that it includes tools for many of the standard machine-learning tasks (such as clustering, classification, regression, etc.).

Six reasons why Ben Lorica (@bigdata) recommends scikit-learn

Four: Python and Pydata

An impressive set of Python data tools (pydata) have emerged over the last few years.

Six reasons why Ben Lorica (@bigdata) recommends scikit-learn

Five: Focus

Scikit-learn is a machine-learning library. Its goal is to provide a set of common algorithms to Python users through a consistent interface.

Six reasons why Ben Lorica (@bigdata) recommends scikit-learn

Six: scikit-learn scales to most data problems

Many problems can be tackled using a single (big memory) server, and well-designed software that runs on a single machine can blow away distributed systems.

This talk is not...

...an introduction to Python

...an introduction to machine learning

Example



In [4]:

    
from sklearn import datasets
from numpy import logical_or
from sklearn.lda import LDA
from sklearn.metrics import confusion_matrix



In [5]:

    
iris = datasets.load_iris()
subset = logical_or(iris.target == 0, iris.target == 1)

X = iris.data[subset]
y = iris.target[subset]



In [6]:

    
print X[0:5,:]









    



[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]



In [7]:

    
print y[0:5]









    



[0 0 0 0 0]



In [8]:

    
# Linear Discriminant Analysis
lda = LDA(2)
lda.fit(X, y)

confusion_matrix(y, lda.predict(X))









    Out[8]:





array([[50,  0],
       [ 0, 50]])

The Scikit-learn API

The main "interfaces" in scikit-learn are (one class can implement multiple interfaces):

Estimator:

estimator = obj.fit(data, targets)

Predictor:

prediction = obj.predict(data)

Transformer:

new_data = obj.transform(data)

Model:

score = obj.score(data)

Scikit-learn API: the Estimator

All estimators implement the fit method:

estimator.fit(X, y)

A estimator is an object that fits a model based on some training data and is capable of inferring some properties on new data.



In [9]:

    
from sklearn.linear_model import LogisticRegression



In [10]:

    
# Create Model
model = LogisticRegression()
# Fit Model
model.fit(X, y)









    Out[10]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)

(Almost) everything is an estimator

Unsupervised Learning



In [11]:

    
from sklearn.cluster import KMeans



In [12]:

    
# Create Model
kmeans = KMeans(n_clusters = 2)
# Fit Model
kmeans.fit(X)









    Out[12]:





KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,
    n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001,
    verbose=0)

Dimensionality Reduction



In [13]:

    
from sklearn.decomposition import PCA



In [14]:

    
# Create Model 
pca = PCA(n_components=2)
# Fit Model
pca.fit(X)









    Out[14]:





PCA(copy=True, n_components=2, whiten=False)

The fit method takes a $y$ parameter even if it isn't needed (though $y$ is ignored). This is important later.



In [15]:

    
from sklearn.decomposition import PCA



In [16]:

    
pca = PCA(n_components=2)
pca.fit(X, y)









    Out[16]:





PCA(copy=True, n_components=2, whiten=False)

Feature Selection



In [17]:

    
from sklearn.feature_selection import SelectKBest
from sklearn.metrics import matthews_corrcoef



In [18]:

    
# Create Model
kbest = SelectKBest(k = 3)
# Fit Model
kbest.fit(X, y)









    Out[18]:





SelectKBest(k=1, score_func=<function f_classif at 0x1139f3398>)

(Almost) everything is an estimator!



In [83]:

    
model = LogisticRegression()
model.fit(X, y)

kbest = SelectKBest(k = 1)
kbest.fit(X, y)

kmeans = KMeans(n_clusters = 2)
kmeans.fit(X, y)

pca = PCA(n_components=2)
pca.fit(X, y)









    Out[83]:





PCA(copy=True, n_components=2, whiten=False)

What can we do with an estimator?

Inference!



In [19]:

    
model = LogisticRegression()
model.fit(X, y)
print model.coef_









    



[[-0.40731745 -1.46092371  2.24004724  1.00841492]]



In [20]:

    
kmeans = KMeans(n_clusters = 2)
kmeans.fit(X)
print kmeans.cluster_centers_









    



[[ 5.936  2.77   4.26   1.326]
 [ 5.006  3.418  1.464  0.244]]



In [21]:

    
pca = PCA(n_components=2)
pca.fit(X, y)
print pca.explained_variance_









    



[ 2.73946394  0.22599044]



In [22]:

    
kbest = SelectKBest(k = 1)
kbest.fit(X, y)
print kbest.get_support()









    



[False False  True False]

Is that it?

Scikit-learn API: the Predictor



In [23]:

    
model = LogisticRegression()
model.fit(X, y)

X_test = [[ 5.006,  3.418,  1.464,  0.244], [ 5.936,  2.77 ,  4.26 ,  1.326]]

model.predict(X_test)









    Out[23]:





array([0, 1])



In [24]:

    
print model.predict_proba(X_test)









    



[[ 0.97741151  0.02258849]
 [ 0.01544837  0.98455163]]

Scikit-learn API: the Transformer



In [25]:

    
pca = PCA(n_components=2)
pca.fit(X)

print pca.transform(X)[0:5,:]









    



[[-1.65441341 -0.20660719]
 [-1.63509488  0.2988347 ]
 [-1.82037547  0.27141696]
 [-1.66207305  0.43021683]
 [-1.70358916 -0.21574051]]

fit_transform is also available (and is sometimes faster).



In [54]:

    
pca = PCA(n_components=2)
print pca.fit_transform(X)[0:5,:]









    



[[-1.65441341 -0.20660719]
 [-1.63509488  0.2988347 ]
 [-1.82037547  0.27141696]
 [-1.66207305  0.43021683]
 [-1.70358916 -0.21574051]]



In [26]:

    
kbest = SelectKBest(k = 1)
kbest.fit(X, y)

print kbest.transform(X)[0:5,:]









    



[[ 1.4]
 [ 1.4]
 [ 1.3]
 [ 1.5]
 [ 1.4]]

Scikit-learn API: the Model



In [27]:

    
from sklearn.cross_validation import KFold
from numpy import arange
from random import shuffle
from sklearn.dummy import DummyClassifier



In [86]:

    
model = DummyClassifier()
model.fit(X, y)

model.score(X, y)









    Out[86]:





0.48999999999999999

Building Pipelines



In [87]:

    
from sklearn.pipeline import Pipeline



In [55]:

    
pipe = Pipeline([
          ("select", SelectKBest(k = 3)),
          ("pca", PCA(n_components = 1)),
          ("classify", LogisticRegression())
          ])

pipe.fit(X, y)

pipe.predict(X)









    Out[55]:





array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1])

Intermediate steps of the pipeline must be Estimators and Transformers.

The final estimator needs only to be an Estimator.

Text Pipeline



In [78]:

    
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier



In [71]:

    
news = fetch_20newsgroups()
data = news.data
category = news.target



In [72]:

    
len(data)









    Out[72]:





11314



In [92]:

    
print "  ".join(news.target_names)









    



alt.atheism  comp.graphics  comp.os.ms-windows.misc  comp.sys.ibm.pc.hardware  comp.sys.mac.hardware  comp.windows.x  misc.forsale  rec.autos  rec.motorcycles  rec.sport.baseball  rec.sport.hockey  sci.crypt  sci.electronics  sci.med  sci.space  soc.religion.christian  talk.politics.guns  talk.politics.mideast  talk.politics.misc  talk.religion.misc



In [99]:

    
print data[8]









    



From: holmes7000@iscsvax.uni.edu
Subject: WIn 3.0 ICON HELP PLEASE!
Organization: University of Northern Iowa
Lines: 10

I have win 3.0 and downloaded several icons and BMP's but I can't figure out
how to change the "wallpaper" or use the icons.  Any help would be appreciated.


Thanx,

-Brando

PS Please E-mail me



In [100]:

    
pipe = Pipeline([
    ('vect', CountVectorizer(max_features = 100)),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

pipe.fit(data, category)









    Out[100]:





Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=100, min_df=1,
        ngram_range=(1, 1), prepr..., penalty='l2', power_t=0.5,
       random_state=None, shuffle=False, verbose=0, warm_start=False))])

Pandas Pipelines!



In [107]:

    
import pandas as pd
import numpy as np
import sklearn.preprocessing, sklearn.decomposition, sklearn.linear_model, sklearn.pipeline, sklearn.metrics
from sklearn_pandas import DataFrameMapper, cross_val_score



In [117]:

    
data = pd.DataFrame({
    'pet':      ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
    'children': [4., 6, 3, 3, 2, 3, 5, 4],
    'salary':   [90, 24, 44, 27, 32, 59, 36, 27]
})



In [111]:

    
mapper = DataFrameMapper([
     ('pet', sklearn.preprocessing.LabelBinarizer()),
     ('children', sklearn.preprocessing.StandardScaler()),
     ('salary', None)
])



In [113]:

    
mapper.fit_transform(data)









    Out[113]:





array([[  1.        ,   0.        ,   0.        ,   0.20851441,  90.        ],
       [  0.        ,   1.        ,   0.        ,   1.87662973,  24.        ],
       [  0.        ,   1.        ,   0.        ,  -0.62554324,  44.        ],
       [  0.        ,   0.        ,   1.        ,  -0.62554324,  27.        ],
       [  1.        ,   0.        ,   0.        ,  -1.4596009 ,  32.        ],
       [  0.        ,   1.        ,   0.        ,  -0.62554324,  59.        ],
       [  1.        ,   0.        ,   0.        ,   1.04257207,  36.        ],
       [  0.        ,   0.        ,   1.        ,   0.20851441,  27.        ]])



In [157]:

    
mapper = DataFrameMapper([
     ('pet', sklearn.preprocessing.LabelBinarizer()),
     ('children', sklearn.preprocessing.StandardScaler()),
     ('salary', None)
])

pipe = Pipeline([
    ("mapper", mapper),
    ("pca", PCA(n_components=2))
])
pipe.fit_transform(data) # 'data' is a data frame, not a numpy array!









    Out[157]:





array([[ -4.76269151e+01,   4.25991055e-01],
       [  1.83856756e+01,   1.86178138e+00],
       [ -1.62747544e+00,  -5.06199939e-01],
       [  1.53796381e+01,  -8.10331853e-01],
       [  1.03575109e+01,  -1.52528125e+00],
       [ -1.66260441e+01,  -4.27845667e-01],
       [  6.37295205e+00,   9.68066902e-01],
       [  1.53846579e+01,   1.38193738e-02]])

Pandas pipelines require sklearn-pandas module by @paulgb.

Also by Paul:

Model Evaluation and Selection



In [212]:

    
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier



In [137]:

    
# Create sample dataset
X, y = datasets.make_classification(n_samples = 1000, n_features = 40, n_informative = 6, n_classes = 2)



In [162]:

    
# Pipeline for Feature Selection to Random Forest
pipe = Pipeline([
  ("select", SelectKBest()),
  ("classify", RandomForestClassifier())
])



In [175]:

    
# Define parameter grid
param_grid = {
  "select__k" : [1, 6, 20, 40],
  "classify__n_estimators" : [1, 10, 100],
  
}
gs = GridSearchCV(pipe, param_grid)



In [183]:

    
# Search over grid
gs.fit(X, y)

gs.best_params_









    Out[183]:





{'classify__n_estimators': 10, 'select__k': 6}



In [192]:

    
print gs.best_estimator_.predict(X.mean(axis = 0))

[1]

Curse of Dimensionality

Search space grows exponentially with number of parameters.



In [185]:

    
gs.grid_scores_









    Out[185]:





[mean: 0.72600, std: 0.02773, params: {'classify__n_estimators': 1, 'select__k': 1},
 mean: 0.78200, std: 0.00631, params: {'classify__n_estimators': 1, 'select__k': 6},
 mean: 0.74400, std: 0.02580, params: {'classify__n_estimators': 1, 'select__k': 20},
 mean: 0.70600, std: 0.05772, params: {'classify__n_estimators': 1, 'select__k': 40},
 mean: 0.73800, std: 0.02372, params: {'classify__n_estimators': 10, 'select__k': 1},
 mean: 0.90000, std: 0.01539, params: {'classify__n_estimators': 10, 'select__k': 6},
 mean: 0.86400, std: 0.01047, params: {'classify__n_estimators': 10, 'select__k': 20},
 mean: 0.81200, std: 0.02247, params: {'classify__n_estimators': 10, 'select__k': 40},
 mean: 0.73600, std: 0.02229, params: {'classify__n_estimators': 100, 'select__k': 1},
 mean: 0.89200, std: 0.01520, params: {'classify__n_estimators': 100, 'select__k': 6},
 mean: 0.89000, std: 0.01769, params: {'classify__n_estimators': 100, 'select__k': 20},
 mean: 0.87000, std: 0.02366, params: {'classify__n_estimators': 100, 'select__k': 40}]

Curse of Dimensionality: Parallelization

GridSearch on 1 core:



In [207]:

    
param_grid = {
  "select__k" : [1, 5, 10, 15, 20, 25, 30, 35, 40],
  "classify__n_estimators" : [1, 5, 10, 25, 50, 75, 100],
  
}
gs = GridSearchCV(pipe, param_grid, n_jobs = 1)
%timeit gs.fit(X, y)
print









    



1 loops, best of 3: 6.31 s per loop

GridSearch on 7 cores:



In [208]:

    
gs = GridSearchCV(pipe, param_grid, n_jobs = 7)
%timeit gs.fit(X, y)
print









    



1 loops, best of 3: 1.81 s per loop

Curse of Dimensionality: Randomization

GridSearchCV might be very slow:



In [220]:

    
param_grid = {
  "select__k" : range(1, 40),
  "classify__n_estimators" : range(1, 100), 
}



In [221]:

    
gs = GridSearchCV(pipe, param_grid, n_jobs = 7)
gs.fit(X, y)
print "Best CV score", gs.best_score_
print gs.best_params_









    



0.924
{'classify__n_estimators': 59, 'select__k': 9}

We can instead randomly sample from the parameter space with RandomizedSearchCV:



In [229]:

    
gs = RandomizedSearchCV(pipe, param_grid, n_jobs = 7, n_iter = 10)
gs.fit(X, y)
print "Best CV score", gs.best_score_
print gs.best_params_









    



0.894
{'classify__n_estimators': 58, 'select__k': 7}

Conclusions

Scikit-learn has an elegant API and is built in a beautiful language.

Pipelines allow complex chains of operations to be easily computed.
- This helps ensure correct cross validation (see Elements of Statistical Learning 7.10.2).

Pipelines combined with grid search permit easy model selection.